Table of Contents

Table of Contents

Libraries needed for this notebook

Question - 1

Solution:

Theoretical Solution:

Plotting the derivative function

Numerically finding the roots of derivative function

Conclusion

Some visualisations:

Plot of area with respect to time

Plot of traingles for different values of roots of the derivative function

An animation to visualise the triangle and its area w.r.t. time

Question -2

Solution

Data generation for this toy problem

A basic linear regression model

Regularized linear regression model (RIDGE)

Another regularised model (LASSO) for sparse solution

Question - 3

Solution

Simulation Results

Plotting SR w.r.t $\rho$ for different $f-$values

Question - 4

Solution:

Simulation results

Simulation of the probability of "Getting a three"

Simulation of the probability of "Getting a four"

Question-5

Solution

Downloading data from yahoo

Check for the missing values

Winsorize data

Plot of Stock prices over time

Function to calculate daily returns based on moving average window

Covariance matrix of asset returns

Converting the winsorized data to be used with the ZIPLINE trading algorithm

Markovitz Portfolio Optimization

Running Trading Algorithm using Zipline

Plot of the Portfolio value if we start with $100K

Resulting PnL of this trading algorithm

Extract Maximum Draw Down and plotting

Question - 6

Solution

Simulation results

Question - 7

Simulation results

Function to throw a dice until the sum exceeds a given threshold

Function to simulate the number of draws needed to exactly match a given threshold

Function to calculate the probability of exactly hitting a given threshold given that the sum is $\geq$ threshold. (This probability can be ignored)

Visualisation for the number of draws needed to exactly hit a given threshold

A theoretical solution based on simulation

Libraries needed for this notebook

In [0]:
# !pip install zipline 
# !pip install pandas_datareader 
# !pip install fix_yahoo_finance
# !pip install cvxopt
# !pip install yahoo-finance

Question - 1

Three points ABC are moving around a unit circle with periods of $1s, 2s$ and $3s$. They all start from the same position on the circle at time $t=0$. Solve for the time and the value when the maximal area of the triangle ABC is reached.

Solution:

Theoretical Solution:

We know that an equilateral triangle has the maximum area if it is inscribed within a circle. Therefore, in this situation at a given time $t$, the vertices of triangle is given by $$\left(cos\left(\frac{2\pi t}{1}\right), sin\left(\frac{2\pi t}{1}\right)\right), \left(cos\left(\frac{2\pi t}{2}\right), sin\left(\frac{2\pi t}{2}\right)\right), \left(cos\left(\frac{2\pi t}{3}\right), sin\left(\frac{2\pi t}{3}\right)\right)$$. If $(x_1,y_1), (x_2, y_2) $ and $(x_3,y_3)$ denote the vertices of a triangle then its area is given by $0.5 \times [(x_2−x_1)(y_3−y_1)−(x_3−x_1)(y_2−y_1)]$. Therefore we have area equal to $$0.5\times \left[ \left(cos\left(\frac{2\pi t}{2} \right)-cos\left(\frac{2\pi t}{1}\right) \right) \left(sin\left(\frac{2\pi t}{3} \right)-sin\left(\frac{2\pi t}{1}\right) \right) - \left(cos\left(\frac{2\pi t}{3} \right)-cos\left(\frac{2\pi t}{1}\right) \right) \left(sin\left(\frac{2\pi t}{2} \right)-sin\left(\frac{2\pi t}{1}\right) \right) \right]$$. Using $$sinA−sinB=2cos((A+B)/2) sin((A−B)/2) \mbox{ and } cosA−cosB=−2sin((A+B)/2) sin((A−B)/2)$$. In order to find the area we find the individual elements in the vertex notation: $$x_2 - x_1 = cos\left(\frac{2\pi t}{2}\right) - cos\left(\frac{2\pi t}{1}\right) = 2sin\left(2\pi t\frac{3}{4} \right) sin\left(2\pi t\frac{1}{4} \right),$$ $$y_3 - y_1 = sin\left(\frac{2\pi t}{3}\right) - sin\left(\frac{2\pi t}{1}\right) = -2cos\left(2\pi t\frac{4}{6} \right) sin\left(2\pi t\frac{2}{6} \right),$$ $$x_3 - x_1 = cos\left(\frac{2\pi t}{3}\right) - cos\left(\frac{2\pi t}{1}\right) = 2sin\left(2\pi t\frac{4}{6} \right) sin\left(2\pi t\frac{2}{6} \right),$$ $$y_2 - y_1 = sin\left(\frac{2\pi t}{2}\right) - sin\left(\frac{2\pi t}{1}\right) = -2cos\left(2\pi t\frac{3}{4} \right) sin\left(2\pi t\frac{1}{4} \right).$$ Therefore, the area at time $t$ is given by:

$$2A(t) = -4 sin\left(2\pi t\frac{1}{4} \right)sin\left(2\pi t\frac{2}{6} \right)\left[ sin\left(2\pi t\frac{3}{4} \right) cos\left(2\pi t\frac{4}{6} \right) - sin\left(2\pi t\frac{4}{6} \right) cos\left(2\pi t\frac{3}{4} \right)\right] \\ = -4 sin\left(2\pi t\frac{1}{4} \right)sin\left(2\pi t\frac{1}{3} \right) sin\left(2\pi t\frac{1}{12} \right) \\ =\frac{-4}{2} \left[ cos\left(2\pi t\frac{1}{12}\right) - cos\left(2\pi t\frac{7}{12} \right) \right] sin\left(2\pi t\frac{1}{12} \right) \\ = -2 cos\left(2\pi t\frac{1}{12} \right) sin\left(2\pi t\frac{1}{12} \right) + 2 cos\left(2\pi t\frac{7}{12} \right) sin\left(2\pi t\frac{1}{12} \right) \\ = -sin\left(2\pi t\frac{1}{6} \right) + \frac{2}{2} \left[ sin\left(2\pi t\frac{8}{12} \right) - sin\left(2\pi t\frac{1}{12} \right) \right].$$

In order to find the maximum area, we find the derivative of $A(t)$ w.r.t $t$ which yeilds:

$$\frac{dA(t)}{dt} = -\frac{2\pi}{6}cos\left(2\pi t\frac{1}{6} \right) + \frac{4\pi}{3}cos\left(2\pi t\frac{2}{3} \right) - \frac{2\pi}{2}cos\left(2\pi t\frac{1}{2} \right) = 0 \\ = -\frac{1}{3}cos\left(2\pi t\frac{1}{6} \right) + \frac{4}{3}cos\left(2\pi t\frac{2}{3} \right) - cos\left(2\pi t\frac{1}{2} \right) = 0.$$

We would numerically solve this equation and find its roots which follows below.

Plotting the derivative function

In [133]:
def derivative_function(t):
  """
  Derivative function of the area, upto a constant 
  """
  res = -np.cos(2*np.pi*t/6)/3 + 4*np.cos(4*np.pi*t/3)/3 - np.cos(np.pi*t)
  return res


t = np.linspace(0, 10, 1000)
y = range(len(t))
for i in range(len(t)):
  y[i] = derivative_function(t[i])
  
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

plt.plot(t,y, 'r')
plt.xlabel('Time')
plt.ylabel('Derivative')
Out[133]:
Text(0,0.5,u'Derivative')

Numerically finding the roots of derivative function

In [134]:
# finding the roots of the derivative function 

from scipy import optimize

all_roots = range(10)
for i in range(len(all_roots)):
  all_roots[i] = optimize.brenth(derivative_function, i+0.1, i+1)

print all_roots
[0.9672634196852321, 1.7568035472889338, 2.5451617144182412, 3.4548382855816904, 4.243196452711066, 6.0, 6.967263419685233, 7.756803547288934, 8.54516171441824, 9.45483828558169]

Conclusion

We could simply find the second derivative and check if the second derivative is negative to find local maxima. Therefore, I have written a python function below which simply finds the area of the triangle at the given time instance. And hence we can pick where the maxima occurs. I have also plotted the area of triangle with respect to the time. The time instance where the maxima occurs are $t = 2.5451617144182412, 3.4548382855816904, 8.54516171441824, 9.45483828558169,...$. I have concluded this based on the behavior of the the area with respect to time variable, which is given in visualisation that follows below:

Some visualisations:

In [0]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

def polar_coordinate(t, period):
  """
  Function returns the polar co-ordinate of a point at time t which moves 
  around a circle of radius 1 with period equal to "period"
  """
  res = [np.cos(2*np.pi*t/period), np.sin(2*np.pi*t/period)]
  return res

def vertex_of_triangle_at_time(t, periods = [1,2,3]):
  """
  Function returns the co-ordinates of all three vertices of a triangle 
  where the vertices are given by the position of three particles moving 
  with different periods, at a given time instance.
  """
  res = [polar_coordinate(t, periods[0]), 
         polar_coordinate(t, periods[1]),
         polar_coordinate(t, periods[2])]
  return res

def area_of_triangle_at_time(t):
  """
  Function returns the area of triangle at a given instance t
  """
  # find the vertices of traingle 
  v = vertex_of_triangle_at_time(t)
  # extract all x and y components 
  x = [v[0][0], v[1][0], v[2][0]]
  y = [v[0][1], v[1][1], v[2][1]]
  # calculate the area 
  area = abs(0.5*((x[1] - x[0])*(y[2]-y[0]) - (x[2] - x[0])*(y[1] - y[0])))
  return area

def draw_triangle_inside_circle(t):
  """
  A visualisation to see the positions of all the three particles around the 
  circle of radius 1 which moves with different periods. 
  """
  circle = plt.Circle((0.0, 0.0), 1, color='blue')
  fig = plt.gcf()
  fig.set_size_inches(6, 6)

  plt.gcf().gca().add_artist(circle)
  plt.xlim(-1.2,1.2)
  plt.ylim(-1.2,1.2)
  v = vertex_of_triangle_at_time(t)
  x = [v[0][0], v[1][0], v[2][0], v[0][0]]
  y = [v[0][1], v[1][1], v[2][1], v[0][1]]
  for i in range(3):
    plt.plot(x[i:i+2], y[i:i+2], 'ro-')
  leg = "Area is: %2f"%area_of_triangle_at_time(t)
  #print leg
  plt.text(-1,1, leg, color = 'red')
  

Plot of area with respect to time

In [0]:
import numpy as np

def plot_area_vs_time(t = np.linspace(-10, 10, 1000)):
  y = range(len(t))
  for i in range(len(t)):
    y[i] = area_of_triangle_at_time(t[i])
  plt.plot(t, y)
  plt.xlabel('Time')
  plt.ylabel('Area')
In [137]:
plot_area_vs_time(np.linspace(0,16,1000))

Plot of traingles for different values of roots of the derivative function

In [138]:
# plot the positions of three particles 

jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))

for i in range(len(all_roots)):
  plt.figure()
  draw_triangle_inside_circle(all_roots[i])

An animation to visualise the triangle and its area w.r.t. time

In [139]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation, rc
from IPython.display import HTML

# animate over some set of x, y
patch = plt.Circle((0, 0), 1, fc='y')
# First set up the figure, the axes, and the plot element
fig, ax = plt.subplots()
fig.set_size_inches(6, 6)
plt.close()
ax.set_xlim(( -1.2, 1.2))
ax.set_ylim((-1.2, 1.2))
line1, = ax.plot([], [], lw=2, color = 'red')

line2, = ax.plot([], [], lw=2, color = 'red')
line3, = ax.plot([], [], lw=2, color = 'red')
time_text = ax.text(-1, 1, '', color = 'red')
area_text = ax.text(0.5,1, '', color = 'red')
pt1 = ax.text([],[],'', color = 'black', fontsize = 'x-large')
pt2 = ax.text([],[],'', color = 'black', fontsize = 'x-large')
pt3 = ax.text([],[],'', color = 'black', fontsize = 'x-large')

# initialization function: plot the background of each frame
def init():
    patch.center = (0, 0)
    ax.add_patch(patch)
    return (patch,)
# animation function: this is called sequentially
def animate(i):
  t = 0.01*i
  v = vertex_of_triangle_at_time(t)
  x = [v[0][0], v[1][0], v[2][0], v[0][0]]
  y = [v[0][1], v[1][1], v[2][1], v[0][1]]
  
  line1.set_data([x[0],x[1],x[1],x[2],x[2],x[0]], [y[0],y[1],y[1],y[2],y[2],y[0]])
  area_txt = "Area : %2f"%area_of_triangle_at_time(t)
  time_txt = "Time : %2f"%t
  time_text.set_text(time_txt)
  area_text.set_text(area_txt)
  pt1.set_text('1')
  pt1.set_x(x[0])
  pt1.set_y(y[0])
  pt2.set_text('2')
  pt2.set_x(x[1])
  pt2.set_y(y[1])
  pt3.set_text('3')
  pt3.set_x(x[2])
  pt3.set_y(y[2])
  return (line1,time_text,area_text,)
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=350, interval=100, blit=False)
rc('animation', html='jshtml')
anim
Out[139]:


Once Loop Reflect

Question -2

Consider a multivariate linear model $y = X\beta + \epsilon$, where the noise terms are independent. Generate 100 normally distributed points X with the following covariance matrix, and generate y using $\beta = [1, 0.5, 0] $ and $Var(\epsilon) = 0.1$ $$ \left(\begin{array}{ccc} 1 & 0.1 & 0.999 \\ 0.1 & 1 & 0.1 \\ 0.999 & 0.1 & 1 \end{array}\right) $$

  1. Compute an estimate of the $\beta$ vector from the observed $X$ and $y$
  2. Compute the variance of the elements of the estimated $\beta$-vector
  3. How would you regularize this problem and lower this variance? Compute new variances and $\beta$-estimates.
  4. If you would like a sparse solution where not all entries are non-zero, how could you formulate the problem?

Solution

Data generation for this toy problem

In [0]:
# !apt-get update
# !apt-get install r-base
# !pip install rpy2==2.8.6

import random 
import numpy as np

n = 100
# true value of beta
beta = np.array([1,0.5,0] )
# variance - covariance matrix
cov = np.array([[1,0.1,0.999], [0.1, 1,0.1], [0.999,0.1,1]])

# generating the design matrix 
np.random.seed(1)
X = np.random.multivariate_normal(beta, cov, n)

# Generating error terms 
np.random.seed(1)
epsilon = np.random.normal(0, np.sqrt(0.1), n)

# obtain the dependent variable 
y = X.dot(beta) + epsilon

# m = np.apply_along_axis(np.mean, 0, X)

# X = X - m

A basic linear regression model

In [141]:
import statsmodels.api as sm
ols = sm.OLS(y, X)
ols_result = ols.fit()

print ols_result.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.972
Model:                            OLS   Adj. R-squared:                  0.971
Method:                 Least Squares   F-statistic:                     1112.
Date:                Wed, 25 Jul 2018   Prob (F-statistic):           5.94e-75
Time:                        18:05:30   Log-Likelihood:                -13.130
No. Observations:                 100   AIC:                             32.26
Df Residuals:                      97   BIC:                             40.08
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.0362      0.031     33.589      0.000       0.975       1.097
x2             0.4581      0.031     14.890      0.000       0.397       0.519
x3            -0.0628      0.041     -1.524      0.131      -0.145       0.019
==============================================================================
Omnibus:                        0.777   Durbin-Watson:                   2.046
Prob(Omnibus):                  0.678   Jarque-Bera (JB):                0.357
Skew:                           0.085   Prob(JB):                        0.837
Kurtosis:                       3.238   Cond. No.                         2.94
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
/usr/local/lib/python2.7/dist-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
In [142]:
print "The estimated value of coefficients are : ", ols_result.params, "\n \n "

print "The estimated variance-covariance of the beta estimate is : \n \n", ols_result.cov_HC0
The estimated value of coefficients are :  [ 1.03615032  0.45805317 -0.06281392] 
 
 
The estimated variance-covariance of the beta estimate is : 
 
[[ 0.00074481 -0.00017781 -0.00085408]
 [-0.00017781  0.00111013  0.00030155]
 [-0.00085408  0.00030155  0.00185368]]

Regularized linear regression model (RIDGE)

In [143]:
from sklearn.linear_model import Ridge
import numpy as np

ridge = Ridge(alpha= 6.5, fit_intercept=False, normalize=False).fit(X, y)

print "The coefficients of ridge regression are : ", ridge.coef_, "\n"

def get_var_cov_ridge(ridge, X, y):
  """
  Function to compute the variance covariance matrix of a ridge regression model
  
  Input:
  ------------
  ridge : the ridge regression model - a sklearn object 
  X     : the design matrix - a numpy array 
  y     : the dependent variable 
  
  Output:
  ------------
  The variance covariance matrix of the ridge regression model 
  
  Citation:
  ------------
  https://onlinecourses.science.psu.edu/stat857/node/155/
  
  """
  rss = sum((y - ridge.predict(X))**2)
  no_param = len(ridge.coef_)
  N = len(y)
  sigma_sq_hat = rss/(N - no_param)
  XX = X.T.dot(X)
  XX_plus_lamda_I = XX + np.diag([ridge.alpha for i in range(no_param)])
  inverse = np.linalg.inv(XX_plus_lamda_I)
  res = sigma_sq_hat*(np.matmul(np.matmul(inverse, XX), inverse))
  return res

print "The new variance-covariance matrix of ridge model is : \n \n ", get_var_cov_ridge(ridge, X, y)
The coefficients of ridge regression are :  [ 9.73927752e-01  4.53859518e-01 -9.69530654e-04] 

The new variance-covariance matrix of ridge model is : 
 
  [[ 0.00074456 -0.0002664  -0.00064263]
 [-0.0002664   0.00082595  0.00021972]
 [-0.00064263  0.00021972  0.00127064]]

We can observe from the outout of the ridge model that the $\beta_3$ gets very close to zero at the same time the variance covariance matrix shows a decrease in the variances of the estimated $\beta$ coefficients.

In order to get a sparse solution for the estimated beta coefficients, we can consider a LASSO model which considers a L1 penalization term on top of the the regular loss function. As we can observe from the below output that the LASSO model sets the two coefficinets to zero.

Another regularised model (LASSO) for sparse solution

In [144]:
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=1.0, fit_intercept=False, normalize=False).fit(X, y)

print "The coefficients of the LASSO model is: ", lasso.coef_
The coefficients of the LASSO model is:  [0.59886984 0.         0.        ]

Question - 3

Let $\Sigma = \sigma^2 \left[\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right] $ be $2\times2$ covariance matrix where $\sigma = 0.3$ is the volatility and $\rho ∈ [−1,1]$ is unknown correlation. Let $f$ be a $2-d$ known vector specified below. For any number $\lambda > 0$ let $w_f^*(\lambda)$ be a $2-d$ vector solving the problem $\max\{w^T f −\frac{1}{2}\lambda w^T \Sigma w \}$ under constrain $w^T 1 = 0$. For any $\lambda$ calculate following quantities: $\mu^* = f^T w^*, \sigma^*{^2} = w^{*^T} \Sigma w^*$ and $SR = \mu^* / \sigma^*$. Does $SR$ depend on $\lambda$? Study the dependence of $SR$ on correlation parameter $\rho$ for vectors $f_1, f_2, f_3: f_1 = [1,0]^T, f_2 = [0,1]^T, f_3 = [1, −1]^T$.

Solution

Since we are interested in maximising $g(w) = \max\{w^T f −\frac{1}{2}\lambda w^T \Sigma w \}$w.r.t $w^T 1 = w_1 + w_2 = 0$, we can substitute $w_2 = -w_1$. The expression for $g(w)$ can be further simplified to $$g(w) = w_1f^1 - w_1 f^2 - \frac{\lambda \sigma^2}{2} [w_1, -w_1] \left[\begin{array}{cc} 1 & \rho \\ \rho & 1 \end{array}\right] \left[\begin{array}{c} w_1 \\ -w_1 \end{array}\right]$$, where $f = [f^1, f^2] $ and $w =[w_1, -w_1]$. The matrix multiplication in the expression of $g(w)$ further results into $$g(w) = w_1(f^1 - f^2) - \frac{\lambda \sigma^2}{2}2w_1^2(1-\rho)$$. We take the derivative w.r.t $w_1$ and equate it to zero $$(f^1 - f^2) - \lambda \sigma^2 2w_1(1-\rho) = 0$$ which solves to yeild $$w_1 = \frac{(f^1 - f^2)}{2 \lambda \sigma^2 (1 -\rho)}$$. Noting that $\mu^* = f^T w^*$, and $w^* = [w_1, -w_1]^T$ we have $$\mu^* = f^T w^* = w_1(f^1 - f^2) = \frac{(f^1 - f^2)^2}{2 \lambda \sigma^2 (1 -\rho)}$$. At the same time $\sigma^*{^2}$ evaluates to $$\sigma^*{^2} = 2w_1^2 (1-\rho) = \frac{(f^1 - f^2)^2}{2 \lambda^2 \sigma^4 (1 -\rho)}$$. Therefore $SR$ is given by: $$SR = \frac{(f^1 - f^2)^2}{2 \lambda \sigma^2 (1 -\rho)} \frac{\sqrt{2}\lambda \sigma^2 \sqrt{(1-\rho)}}{(f^1 - f^2)} = \frac{(f^1 - f^2)}{\sqrt{2}\sqrt{(1-\rho)}}$$, which is independent of $\lambda$.

Simulation Results

In [0]:
import numpy as np
import pandas as pd

def SR_function(rho, f):
  """
  Function calculates the SR for a given value of rho and f
  input:
  ---------------------------------------
  rho : the value of correlation coefficient
  f   : a bivariate list of f-values 
  """
  # calculate the value of SR
  res = (f[0]-f[1])/(np.sqrt(2*(1-rho)))
  return res
  
# 
rho = np.linspace(-1,0.99,100)
SR_f1 = range(len(rho))
SR_f2 = range(len(rho))
SR_f3 = range(len(rho))

for i in range(len(rho)):
  SR_f1[i] = SR_function(rho[i], [1,0])
  SR_f2[i] = SR_function(rho[i], [0,1])
  SR_f3[i] = SR_function(rho[i], [1,-1])

Plotting SR w.r.t $\rho$ for different $f-$values

In [146]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

fig = plt.figure()
ax = plt.axes()

plt.plot(rho, SR_f1, color = "blue", label = "SR_f1")
plt.plot(rho, SR_f2, color = 'black', label = "SR_f2")
plt.plot(rho, SR_f3, color = 'green', label = "SR_f3")
plt.xlabel("rho")
plt.ylabel("SR")
plt.legend()
Out[146]:
<matplotlib.legend.Legend at 0x7ff7e3063a50>

Question - 4

If three fair dice are thrown,

  1. What is the probability of “getting a three”, where this is defined to mean that any subset of the dice adds to three. Thus, you win on e.g. (1,1,1),(1,2,4),(5,3,1) or (1,3,2)
  2. What is the probability of “getting a four”, where this is defined to mean that any subset of the dice adds to four. Thus you win on e.g. (1,1,2), (1,3,4), (5,4,1) or (2,3,2)

Please provide an analytic solution and a simulation that confirms your analytic calculation for both cases A and B. In simulation, please calculate expected value and standard error.

Solution:

  1. Let $E$ be the event associated with "getting a three". Then $E$ constitutes of $$E = \{(1,1,1)\} \cup \{\text{at least one 3}\} \cup \{\text{1 and 2 but not 3}\} := E_1 \cup E_2 \cup E_3$$. The cardinality of three sets in the above definition is given by: $$|E_1| = 1, |E_2|=6^3 - 5^3 = 91 (\text{total cases} - \text{no 3 cases})$$. However we should note that $$|E_3| = 3\times 2 \times 5 - 6 = 24$$. This is due to the fact that first 1 can be allotted at three places in 3 ways, 2 can be allotted in 2 ways at remaining 2 places while the last remaining positions can be filled by numbers 1,2,4,5,6 in 5 ways. However, we should keep in mind that during this calculation we have double counted entries such as $(1,1,2), (1,2,1), (2,1,2), (2,1,1), (2,2,1), (1,2,2)$ twice therefore we discount for this. Therefore we have $$P(E) = \frac{1+91+24}{216} = \frac{116}{216} = 0.5370370370370371.$$

  2. With slight abuse of notation we denote by $E$ the event of "getting a four". Then $E$ consitutes of the cases : $$E = \{\text{ at least one 4}\} \cup \{\text{2 and 2 but not 4}\} \cup \{\text{1 and 3 but not 4}\}\}.$$ As observed earlier the cardinality of $$|\{\text{ at least one 4}\} | = 91 \mbox{ and } |\{\text{1 and 3 but not 4}\}\}| = 24.$$ At the same time the cases where $2$ appears at least 2 times but 4 does not appearare $(1, 2, 2),(2, 1, 2),(2, 2, 1),(2, 2, 2),(2, 2, 3),(2, 2, 5),(2, 2, 6),(2, 3, 2),(2, 5, 2),(2, 6, 2),(3, 2, 2),(5, 2, 2),(6, 2, 2)$. Therefore, $$| \{\text{2 and 2 but not 4}\}| = 13.$$ This can be calculated intuitively as well: numbers $\{1,2,3,5,6\}$ can be permuted with $(2,2)$ in a triplet in $15$ ways, meanwhile we count $(2,2,2)$ three times. Therefore, $$| \{\text{2 and 2 but not 4}\}| = 15 - 2 = 13.$$ The required probability is given by: $$P(E) = \frac{91+24+13}{216} = \frac{128}{216} = 0.5925925925925926.$$

Simulation results

In [147]:
def simulate_three_dice():
  """
  Function which rolls three dice simultaneously 
  """
  return [random.randint(1,6) for i in range(3)]

# a sample draw of three dices
simulate_three_dice()
Out[147]:
[1, 6, 5]

Simulation of the probability of "Getting a three"

In [0]:
import itertools

def getting_a_three(nsim = 1000000):
  """
  Function to calculate the probability of the event of "getting a three"
  """
  # set a counter to 0
  count = 0.0
  # loop over the number of simulations to be done 
  for i in range(nsim):
    # make of sample draw of the three dices 
    draw = simulate_three_dice()
    # check if there is at least one 3 in draw
    if 3 in draw:
      # increment the counter 
      count = count+1
    # if 1 and 2 are present but not three
    elif (1 in draw and 2 in draw):
      # increment the counter 
      count = count +1 
    # if the event is (1,1,1) 
    elif sum(draw) == 3:
      count = count + 1
  return count/nsim
In [149]:
nsim = 10000000
tmp = getting_a_three(nsim)

print "The simulated probability is :", tmp, " with standard error of ", np.sqrt(tmp*(1-tmp)/nsim)
The simulated probability is : 0.5365748  with standard error of  0.00015769029266412058

Simulation of the probability of "Getting a four"

In [0]:
def getting_a_four(nsim = 1000000):
  # set a counter to 0
  count = 0.0
  # loop over the number of simulations to be done 
  for i in range(nsim):
    # make a roll of three dices
    draw = simulate_three_dice()
    # if there is at least one 4 in the draw
    if 4 in draw:
      # increment the counter 
      count = count+1
    else:
      # we permute all the draws to check if events of kind (2,2,.), (2,.,2), 
      # (.,2,2) are present in the draw
      perm = itertools.permutations(draw, 2)
      # check if threre is at least one 2 or 1 and 3 both 
      if (tuple([2,2]) in perm) or (1 in draw and 3 in draw):
        # increment the counter by 1
        count = count + 1
  return count/nsim
In [151]:
nsim = 10000000
tmp = getting_a_four(nsim)

print "The simulated probability is :", tmp, " with standard error of ", np.sqrt(tmp*(1-tmp)/nsim)
The simulated probability is : 0.5926484  with standard error of  0.0001553757619377746

Question-5

Download some price data, for instance from Yahoo finance for 10 stocks and 10 years.

  1. Assume the data contains missing and erroneous data points and write a function that fills the gaps sensibly and cleans the returns using winsorization.
  2. Compute a signal, for example given by the return between today’s price and the average price during the previous 20 days.
  3. Write a function that computes a covariance matrix of daily returns.
  4. Given the signal and covariance matrix, compute daily positions using Markowitz portfolio optimization.

  5. Compute the resulting profit and loss of this strategy.

  6. Write a function that given the pnl time series computes the maximum draw-down, its depth and length, where a drawdown lasts until new high is reached.

  7. Plot the five deepest ones on top of each other (starting from the first day of the draw- down).

Solution

Note:

  1. Question 5.1 has been done separately while 5.2, 5.3, 5.4, 5.5 has been implemented within the zipline trading algorithm.

Downloading data from yahoo

In [163]:
# !pip install zipline
# !pip install fix_yahoo_finance
# !pip install yahoo-finance

import pandas
from pandas_datareader import data as pdr
import fix_yahoo_finance as yf

yf.pdr_override() 

start_date = '2007-07-22'
end_date = '2017-07-22' # change the date to '2017-07-22' for 10 years data

stocks = ['AMZN','IBM', 'GLD', 'XOM', 'AAPL','MSFT', 'TLT', 'SHY','INTC', 'GOOG']

data = pdr.get_data_yahoo(stocks, start=start_date, end=end_date)

data = data.rename(columns = {'Adj Close':'price', 'Close':'close'})

data.shape
[*********************100%***********************]  10 of 10 downloaded
Out[163]:
(2519, 60)

Check for the missing values

In [164]:
def check_missing_values(data):
  """
  Function returns the number of missing values in each column 
  for a pandas dataframe called "data"
  """
  return data.apply(lambda x: sum(x.isnull()))

check_missing_values(data)
Out[164]:
Open    AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
High    AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
Low     AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
close   AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
price   AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
Volume  AAPL    0
        AMZN    0
        GLD     0
        GOOG    0
        IBM     0
        INTC    0
        MSFT    0
        SHY     0
        TLT     0
        XOM     0
dtype: int64

Winsorize data

In [165]:
from scipy.stats import mstats

def winsorize_column(column, lower_lim = 0.05,upper_lim = 0.05):
  """
  Function to winsorize a column of a dataframe 
  
  Input:
  ---------------------
  column : a pandas series or column of the dataframe 
  lower_lim : lower clipping limit
  upper_lim : upper clipping limit
  ---------------------
  Output:
  ---------------------
  Returns a winsorized series 
  """
  return mstats.winsorize(column, limits=[lower_lim,upper_lim])

def winsorize_data(data):
  """
  Function which winsorizes each column of a given dataframe 
  ---------------------
  Input: 
  ---------------------
  data - a pandas dataframe 
  ---------------------
  Output: 
  ---------------------
  a winsorized dataframe 
  """
  return data.transform(winsorize_column)
  
wins_data = winsorize_data(data)
wins_data.head()
Out[165]:
Open ... Volume
AAPL AMZN GLD GOOG IBM INTC MSFT SHY TLT XOM ... AAPL AMZN GLD GOOG IBM INTC MSFT SHY TLT XOM
Date
2007-07-23 20.472857 71.779999 78.550003 257.828064 114.879997 24.620001 31.360001 82.529999 89.980003 91.930000 ... 259122500 9267800 3892400 12796000 8040900 56431300 48910600 439300 1624300 24376200
2007-07-24 19.840000 71.040001 78.550003 253.004456 115.320000 24.410000 31.010000 82.529999 89.980003 92.519997 ... 306134500 12632900 3892400 11216600 10499300 68543100 59729300 439300 1624300 31844800
2007-07-25 19.621429 84.660004 78.550003 256.819641 116.190002 24.680000 30.990000 82.529999 89.980003 91.239998 ... 306134500 12632900 7530500 11162000 11340000 50743000 54950100 439300 1624300 31459600
2007-07-26 20.844286 85.019997 78.550003 252.726257 117.010002 24.340000 30.240000 82.529999 89.980003 90.410004 ... 306134500 12632900 10704100 13856200 12141000 86328700 87025300 1740500 2866400 42007600
2007-07-27 20.884285 84.269997 78.550003 252.621933 116.620003 23.889999 29.930000 82.529999 89.980003 87.910004 ... 290274600 12632900 5341900 11089800 12141000 79410500 69214600 1063600 5367800 41077700

5 rows × 60 columns

Plot of Stock prices over time

In [166]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

def plot_stock_Data(data, column_name = 'Close'):
  """
  Function to plot the price attribute of a stock data
  
  Input:
  --------------------
  data : a stock dataframe 
  columnname : price attribute to plot default is 'Close' price
  --------------------
  Output:
  --------------------
  Plots the column_name for all stocks in the dataframe 
  """
  data[column_name].plot(lw=1)
  plt.ylabel(column_name)

plot_stock_Data(wins_data, 'price')

Function to calculate daily returns based on moving average window

In [167]:
def daily_return(data, day_window = 20):
  """
  Function calculates the daily return for a given price data of stocks
  Input:
  -----------------------
  data : a price data of different stocks
  day_window : a waindow in the recent past that should be considered 
  -----------------------
  Output:
  -----------------------
  daily returns of the stock data based on previous 20 days rolling mean
  """
  returns = data.apply(lambda x : (x - x.rolling(window=day_window).mean())/x.rolling(window=20).mean())
  return returns.dropna() 

tmp = daily_return(data)
tmp.head()
Out[167]:
Open ... Volume
AAPL AMZN GLD GOOG IBM INTC MSFT SHY TLT XOM ... AAPL AMZN GLD GOOG IBM INTC MSFT SHY TLT XOM
Date
2007-08-17 -0.090051 -0.041492 -0.023822 -0.027181 -0.012164 -0.006258 -0.045856 0.005571 -0.003891 -0.040987 ... -0.028125 -0.470796 -0.014413 0.119241 0.339592 0.310946 0.218947 -0.127452 -0.024672 0.021861
2007-08-20 -0.068789 -0.037169 -0.014220 -0.015770 -0.019005 -0.008726 -0.037601 0.005960 0.001832 -0.018184 ... -0.340456 -0.562239 -0.311153 -0.427654 -0.143797 -0.203762 -0.207292 -0.071989 -0.519259 -0.331065
2007-08-21 -0.076151 -0.048992 -0.010498 -0.021673 -0.033954 -0.003106 -0.035541 0.008919 0.002815 -0.005906 ... 0.091900 -0.334657 -0.314406 -0.217571 -0.257396 -0.187856 -0.188293 -0.227849 -0.307121 -0.312035
2007-08-22 -0.005736 0.006794 -0.008408 0.000624 -0.020748 0.006032 -0.025156 0.005705 0.001674 -0.010594 ... -0.093790 -0.314408 -0.332477 -0.277182 -0.431795 -0.283901 -0.278681 -0.331543 -0.483415 -0.345142
2007-08-23 0.013355 0.019791 -0.001400 0.011755 -0.015095 0.014069 -0.021504 0.004499 0.001933 -0.003066 ... -0.216003 -0.378052 -0.494137 -0.286087 -0.227894 -0.441513 -0.429527 -0.243564 -0.243608 -0.275050

5 rows × 60 columns

Covariance matrix of asset returns

In [0]:
def calculate_covarinace_matrix(returns):
  """
  Function calculate the covariance matrix of the daily returns 
  Input:
  ------------------
  returns : a dataframe of returns 
  ------------------
  Output:
  ------------------
  covarince matrix of retruns
  """
  return returns.cov()

Converting the winsorized data to be used with the ZIPLINE trading algorithm

In [170]:
panel_data = wins_data.stack(0)

panel_data = panel_data.reset_index()

mul_ind_data = panel_data.set_index(['Date','level_1'])

panel = mul_ind_data.to_panel()

panel
/usr/local/lib/python2.7/dist-packages/ipykernel_launcher.py:7: DeprecationWarning: 
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

  import sys
Out[170]:
<class 'pandas.core.panel.Panel'>
Dimensions: 10 (items) x 2519 (major_axis) x 6 (minor_axis)
Items axis: AAPL to XOM
Major_axis axis: 2007-07-23 00:00:00 to 2017-07-21 00:00:00
Minor_axis axis: High to price

Markovitz Portfolio Optimization

If $x \in R^n$ represents a vector of weight of portfolio c orresponding to n different stocks. Then we are interested in minimising a risk measure $$ \mbox{min } x^T \Sigma x \\ \mbox{subject to } r^T_{avg} \geq r_{min} \mbox{ and } \sum \limits_{i=1}^n x_i = 1 \mbox{ where } x \geq 0.$$ This is a quadratic optimization problem which can be put into an alternative form given by $$\min x^T P x + q^Tx \\ \mbox{subject to } Gx \leq h \\ \mbox{and } Ax = b$$, where $P = \Sigma$ and $q = 0$. The inequality constraint $r^T_{avg} \geq r_{min}, x\geq 0$ is equivalent to $Gx \leq h$ however, the equality constraint $\sum \limits_{i=1}^n x_i = 1$ is equivalent to $Ax = b$. Here $G = -I_n$, $h$ is a vector of zeroes, $A$ is vector of 1's and $b$ equals 1.

In [0]:
from cvxopt import blas, solvers
import cvxopt as opt
from cvxopt.solvers import qp
import numpy as np

def calculate_optimal_weights(returns):
    """
    Function to calculate the optimal portfolio based on returns supplied by
    the user. Function uses Markovitz portfolio optimization technique to 
    obtain the optimal portfolio. 
    
    Input:
        returns : a return corresponding to different stocks 
    output:
        an optimal weight corresponding to different stocks summing to 1.
    Method:
        If $x \in R^n$ represents a vector of weight of portfolio c
        orresponding to n different stocks. Then we are interested in minimising
        a risk measure $x^T \Sigma x$ subject to $r^T_{avg} \geq r_{min}$
        and $\sum \limits_{i=1}^n x_i = 1$ where $x \geq 0$. This is a 
        quadratic optimization problem which can be put into an alternative 
        form given by $$\min x^T P x + q^Tx \\ \mbox{subject to } Gx \leq h \\
        \mbox{and } Ax = b$$, where $P = \Sigma$ and $q = 0$. The inequality
        constraint $r^T_{avg} \geq r_{min}, x\geq 0$ is equivalent to 
        $Gx \leq h$ however, the equality constraint 
        $\sum \limits_{i=1}^n x_i = 1$ is equivalent to $Ax = b$. 
    """
    n = len(returns)
    returns = np.asmatrix(returns)
    # initialise a vector of mean returns for calulating efficient frontier
    r_min = 0.10 # this is minimum return threshold
    mu_vector = [r_min+0.001*i for i in range(100)]
    
    # We convert the covariance matrix of returns to cvxopt matrix
    Sigma = opt.matrix(np.cov(returns))
    r_avg = opt.matrix(np.mean(returns, axis=1))
    
    # we intialize all the constraint matrices 
    G = -opt.matrix(np.eye(n))   
    h = opt.matrix(0.0, (n ,1))
    A = opt.matrix(1.0, (1, n))
    b = opt.matrix(1.0)

    # supress the progress of solvers option 
    opt.solvers.options['show_progress'] = False
    # Calculate efficient frontier weights using quadratic programming
    portfolios = [qp(Sigma, -mu*r_avg, G, h, A, b)['x'] for mu in mu_vector]
    # we calculate porfolio returns and its risk (standard deviations)
    portfolio_returns = [blas.dot(r_avg, x) for x in portfolios]
    portfolio_stdvs = [np.sqrt(blas.dot(x, Sigma*x)) for x in portfolios]
    # we fit the second degree polynomial to portfolio returns and standard
    # deviations in order to find the optimal weight which follows below
    try:
      poly_fit = np.polyfit(portfolio_returns, portfolio_stdvs, 2)
    except np.RankWarning:
      pass
    x_opt = np.sqrt(poly_fit[2] / poly_fit[0])
    # Finally calculate the optimal weight allocated to each portfolio
    weights = qp(opt.matrix(x_opt * Sigma), -r_avg, G, h, A, b)['x']
    return np.asarray(weights)

Running Trading Algorithm using Zipline

In [0]:
from cvxopt import blas, solvers
import zipline
from zipline.api import history,set_slippage,slippage, set_commission
from zipline.api import order_target_percent,commission

from zipline import TradingAlgorithm

def initialize(context):
    '''
    Initialize function as per the zipline Trading algorithm 
    
    Input:
    -----------
    Context : see zipline reference manual
    '''
    # We turn off the slippage, commision and trading cost as the Markovitz 
    # portfolio  optimization considers no such costs in the optimization model
    set_slippage(slippage.FixedSlippage(spread=0.0))
    set_commission(commission.PerShare(cost=0.00, min_trade_cost=0.0))
    context.i = 0
    
def handle_data(context, data):
    '''
    Function is called within TradingAlgorithm function from zipline
    
    Input :
    -----------------
    context : as defined earlier 
    data    : the stock data 
    '''
    # We allow the data to accumulate for the first 90 days and no trading is 
    # done until the first 90 days and then the information accumulated so far 
    # is used to build the trading algorithm 
    context.i += 1
    if context.i < 100:
        return
    # We trade based on the prices of past 3+ months 
    prices = history(100, '1d', 'price').dropna()
    # we calculate the percentage change between the moving average window of past 20 days 
    # and the current price 
    returns = prices.apply(lambda x: (x-x.rolling(window=20).mean())/x.rolling(window=20).mean()).dropna()
    try:
      # Perform Markowitz-style portfolio optimization
      weights= calculate_optimal_weights(returns.T)
      # Place an order to adjust a position to a target value. If the position 
      # doesn’t already exist, this is equivalent to placing a new order. 
      # If the position does exist, this is equivalent to placing an order for the 
      # difference between the target value and the current value.
      # citation -- zipline reference manual
      for stock, weight in zip(prices.columns, weights):
        order_target_percent(stock, weight)
    except ValueError:
      pass
        
# Our trading algorithm         
Trading_algo = TradingAlgorithm(initialize=initialize, handle_data=handle_data)
# Run the algorithm using the `panel' data created earlier 
results = Trading_algo.run(panel)

Plot of the Portfolio value if we start with $100K

In [173]:
results.portfolio_value.plot()
Out[173]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff7e2f05a90>

Resulting PnL of this trading algorithm

In [174]:
results.pnl.cumsum().plot()
plt.xlabel('Profit in $')
plt.xlabel('Time')
Out[174]:
Text(0.5,0,u'Time')

Extract Maximum Draw Down and plotting

In [0]:
def max_drawdown_plot(pnl_df, plot = True):
    '''
    Function to compute the maximum draw down for a given PnL timeseries 
    and then plots it.
    
    Input:
        pnl_df : a daily PnL data frame 
        plot   : bool to plot the maximum draw down 
    Output:
        plot : Maximum draw down of the given PnL time series 
        drawdown depth and its period : a list of lists in the format
        [depth of draw down, number of running days]
    '''
    pnl_df['Total Pofit/Loss'] = pnl_df['pnl'].cumsum()
    pnl_df['Max_Draw_Down'] = pnl_df['Total Pofit/Loss'].cummax()
    n = len(pnl_df)
    # a container to store the drawdown depth and its running period in days
    dd_period = []
    # set the initial period equal to 1
    period = 1
    # set the depth to the first value of maximum draw down  
    depth = pnl_df['Total Pofit/Loss'][0]
    # loop over entire data set and store the period and depth of draw down
    for i in range(1, n-1):
      if pnl_df.Max_Draw_Down[i] == pnl_df.Max_Draw_Down[i+1]:
        period += 1
        depth = min(pnl_df['Total Pofit/Loss'][i], depth)
      else:
        dd_period.append([pnl_df['Max_Draw_Down'][i]-depth, period])
        period = 1
        depth = pnl_df['Total Pofit/Loss'][i]
    # if plot is true then we plot the maximum draw down     
    if plot:
      pnl_df[['Total Pofit/Loss', 'Max_Draw_Down']].plot()
      plt.ylabel('Profit in $')
      plt.xlabel('Time')
      plt.show()
    return dd_period
In [176]:
import matplotlib.pyplot as plt

tmp = max_drawdown_plot(pd.DataFrame(results.pnl))

print 'Draw Down with their respective periods are : \n \n ', tmp
Draw Down with their respective periods are : 
 
  [[0.0, 100], [4637.401048, 6], [3510.800523999991, 1], [2562.358428000007, 1], [78.60157199998503, 2], [56956.073294000016, 665], [1607.184930000003, 3], [1676.203944000008, 2], [3027.020000000004, 3], [7197.796055999992, 17], [5255.953330000004, 7], [5229.277331999983, 8], [6076.369999999981, 4], [4202.102000999992, 1], [2774.7226680000167, 2], [4915.786664999992, 8], [1227.2873370000016, 1], [813.7406670000055, 2], [10371.852000999992, 10], [6036.352001000007, 2], [3701.8520010000066, 15], [3808.56399699999, 1], [973.8246690000087, 1], [3174.916664999997, 8], [526.9259980000061, 1], [2227.777331999998, 7], [2147.740666999991, 1], [26128.609346000056, 290], [3719.1048509999964, 1], [4961.882540999999, 3], [3587.434158000018, 1], [1529.2114530000254, 4], [2524.8226080000168, 1], [1732.5087779999885, 1], [1792.5641580000229, 3], [4407.481155000016, 1], [2469.382145999989, 2], [2164.4676900000195, 1], [24511.406072999976, 72], [7272.7563940000255, 5], [9748.326836999971, 1], [1041.6065099999832, 1], [4338.198085999989, 12], [2647.3332050000026, 4], [2339.6557690000045, 2], [2910.1451279999746, 2], [1403.7919230000116, 1], [647.4061539999966, 2], [1314.051923000021, 3], [1421.1205379999883, 1], [4670.411075999989, 4], [3103.4173949999968, 1], [1952.8726050000114, 1], [1999.9306890000007, 1], [501.1541009999928, 1], [45533.229272000055, 544], [2141.5643409999902, 2], [2373.0800000000163, 1], [231.5243409999821, 1], [3125.514212000009, 2], [18449.25, 58], [9930.052664999937, 39], [6207.755340000003, 1], [743.142434999987, 3], [24523.957120000006, 41], [7249.0402100000065, 13], [5366.698665000033, 1], [4450.0, 2], [6359.054005000042, 1], [3350.8495549999643, 1], [2265.048219999997, 2], [120.14510499997414, 2], [21021.79555000001, 1], [2394.0750800000096, 3], [3453.20444999999, 4], [32769.80667499997, 30], [5108.591099999991, 1], [618.5566749999707, 1], [23282.403560000006, 15], [4623.556229999987, 3], [7836.443324999971, 1], [3702.403560000006, 1], [7734.101779999997, 4], [15628.424919999961, 1], [4263.080420000013, 1], [1068.0111249999609, 1], [2710.0348700000322, 1], [4205.255339999974, 2], [1090.2286399999866, 2], [6959.806229999987, 1], [6541.505339999974, 1], [1655.3870950000128, 2], [1864.5513349999674, 1], [13750.510680000007, 8], [4690.2902100000065, 5], [17314.93843000004, 19]]

Question - 6

Shopkeeper Mr. Right observed that his customers arrive randomly into his shop, with average $\lambda = 0.5$ person per $1$ minute. Mr. Right will close his shop when $T = 10$ minutes have passed since the last customer arrived. Show that Mr. Right keeps the shop open for approximately $5$ hours each day. Provide the general solution for arbitrary lambda and $T$. Hint: use the following simple model for customer flow: each $\delta$ instant of time, Hermes, the god of shopkeepers throws an asymmetric coin with Tail probability $p$. If the coin lands on Tails, a customer appears, if it lands on Heads nothing happens at Mr. Right’s shop. Assume = $\delta \lambda$ . What is the coin event corresponding to a 10 minutes window of no customers at all? It is an event of a number of consecutive Heads. How many consecutive Heads? How long on average does Mr. Right need to wait for this coin event?

Solution

Since customers arrive at rate $\lambda$, the waiting time between two consecutive customers is given by an exponential random variable $X$ with rate parameter $\lambda$ i.e $X \sim \exp(\lambda)$. Next we are interested in observing the first time that in a window of 10 minutes, no customer arrive. Let us denote by $N:=\min \{n : X_n > T\}$, where $X_n$ are i.i.d random variable distributed as $\exp(\lambda)$. We should note that $$\mathbb{P} (N = n) = \mathbb{P}(X_i < T \mbox{ for } i = 1,..,n-1 \mbox{ and } X_n > T).$$ Since each $X_i$ are i.i.d and we have $\mathbb{P}(X > T) = e^{-\lambda T}$ therefore, $$\mathbb{P} (N = n) =\left(1-e^{-\lambda T}\right)^{n-1} e^{-\lambda T}.$$ Hence $N \sim Geometric(e^{-\lambda T})$. Now, we are interested in the random variable associated with the time elapsed between start of the day and shop closure; let us denote by $Y$ the random number of minutes passed between the start of the day and closing of shop then, $$Y = \sum \limits_{i=1}^{N-1} X_i | X_i < T + T$$. We should note that the average minutes passed between start of the day and closing of shop is given by $$\mathbb{E}(Y) = \mathbb{E}(N-1)\mathbb{E}(X_i | X_i < T) + T,$$ using the tower property of expectations. However, we should note that $X_i | X_i < T$ has the density function given by $\lambda \exp{(-\lambda x)}/(1-e^{-\lambda T})$, this results into the following expectations: $$\mathbb{E}(X_i | X_i < T) = \frac{1}{\lambda} - T e^{-\lambda T} -\frac{1}{\lambda}e^{-\lambda T} .$$ Since $N$ has a geometric distribution with mean $e^{\lambda T}$ therefore the average number of hours passed is given by, $$\frac{1}{60}\mathbb{E}(Y)= \frac{1}{60}\left((e^{\lambda T}-1) \left( \frac{1}{\lambda} - T e^{-\lambda T} -\frac{1}{\lambda}e^{-\lambda T}\right) + T \right). $$

Simulation results

In [177]:
import random
import numpy as np

def average_elapsed_hours(Lambda = 0.5, T = 10):
  """
  Function calculates theoretical number of hours passed calculated using the above formula
  """
  res = (np.exp(Lambda*T)-1)*(1/Lambda - T*np.exp(-Lambda*T)-(1/Lambda)*np.exp(-Lambda*T))+ T
  return res/60

print "The theoretical average number of hours is:",average_elapsed_hours(), "\n"

def shop_close_time(Lambda = 0.5, T = 10):
  """
  Function returns the shop closing time for a given value of Lambda and T.
  """
  # set the begining of time to 0
  t = 0.0
  # simulate waiting times until the wait_time is greater than T
  while True:
    wait_time = random.expovariate(Lambda)
    if wait_time > T:
      break
      # get the event time by adding waiting time to t
    t += wait_time
  return (t+T)/60
  
def average_shop_close_time(nsim = 100000,Lambda = 0.5, T = 10):
  """
  Function simulates and prints the average shop closing time in hours together 
  with its standard error. 
  """
  x = range(nsim)
  # loop over the number of simulations
  for i in range(len(x)):
    # simulate i-th day shop closing time in hours 
    x[i] = shop_close_time(Lambda, T)
  print "Simulated average shop closing time is:", np.mean(x), "hours with se :", np.std(x)/np.sqrt(nsim), "hours"
  
average_shop_close_time(100000)
The theoretical average number of hours is: 4.881786226152371 

Simulated average shop closing time is: 4.915427189857912 hours with se : 0.015119837773014238 hours

Question - 7

BONUS question: You keep throwing a fair die with scores 1, 2, ... 6 on its faces and you accumulate the scores into a variable X. Calculate a very accurate estimate of the probability that X hits exactly 10000. Use simulation to gain an intuition and demonstrate the experiment.

Simulation results

Function to throw a dice until the sum exceeds a given threshold

In [0]:
import random
def sum_of_dice_values(threshold = 10000):
  """
  Function rolls a dice until the sum is greater than or equal to 
  a given threshold.
  ------------------------
  Input: a threshold 
  ------------------------
  Output : 
  ------------------------
  [sum, number of draws to each sum] -- in a list format.
  """
  # Set the sum equal to zero
  x = 0
  # set the counter to zero as well 
  counter = 0
  # iterate over until the sum exceeds the threshold
  while True:
    # roll and dice and calculate the sum
    x = x + random.randint(1,6)
    # increment the counter 
    counter = counter + 1
    # if the sum exceeds then eject out of the loop
    if x >= threshold:
      break
    else:
      pass
  return [x, counter]

Function to simulate the number of draws needed to exactly match a given threshold

In [0]:
def get_a_match_to_threshold(threshold = 10000):
  """
  Function to find the number of draws needed to exactly match a given threshold 
  -----------------------
  Input : a threshold 
  -----------------------
  Output : the number of draws 
  """
  counter = 0
  while True:
    tmp = sum_of_dice_values(threshold)
    counter = counter + 1
    if tmp[0] == threshold:
      break
  return tmp[1]
In [180]:
get_a_match_to_threshold()
Out[180]:
2858

Function to calculate the probability of exactly hitting a given threshold given that the sum is $\geq$ threshold. (This probability can be ignored)

In [181]:
import numpy as np

def cal_prob_hitting_threshold(nsim = 10000, threshold = 10000):
  """
  Function calculates the probability of hitting a given threshold 
  given that the sum is >= the threshold. 
  
  Input:
  -----------------
  nsim : number of simulations sought with default value 10000.
  threshold : a threshold value with default equal to 10000.
  
  Output:
  -----------------
  prob : probability of hitting the threshold 
  all trials : a list of number of rolls required to exactly match the threshold. 
  """
  # set the counter to zero 
  counter = 0.0
  # a container to store all the trails 
  all_trails = []
  # loop over the number of simulations 
  for i in range(nsim):
    # run a trial until the sum is >= threshold 
    trials = sum_of_dice_values(threshold)
    # check if we have matched the given threshold 
    if trials[0] == threshold: 
      # increment the counter 
      counter += 1
      # store the instances when the threshold is matched exactly 
      all_trails.append(trials[1])
  return [counter/nsim, all_trails]


nsim = 20000
tmp, trials = cal_prob_hitting_threshold(nsim)
print "The simulated probability that X=10K | X >= 10K is :", tmp, " with standard error of ", np.sqrt(tmp*(1-tmp)/nsim)
The simulated probability is : 0.27785  with standard error of  0.003167407279621615

Visualisation for the number of draws needed to exactly hit a given threshold

In [183]:
import pandas as pd
from scipy import stats
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

trials_df = pd.DataFrame(trials)
ax = trials_df.plot(kind = 'hist',normed=True, bins = 20, label = 'Histogram')
trials_df.plot(kind="density", ax = ax, label = 'Density', lw=3)

xt = plt.xticks()[0]  
xmin, xmax = min(xt), max(xt)
lnspc = np.linspace(xmin, xmax, len(trials))

# lets fitthe normal distribution 
m, s = stats.norm.fit(trials) # get mean and standard deviation  
pdf_norm = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval  
plt.plot(lnspc, pdf_norm, label="Normal Fit", lw=3) # plot it


plt.xlabel("Number of draws")
plt.ylabel("Density")
plt.title("Density of number of draws required to hit 10K")
handles, labels = ax.get_legend_handles_labels()
ax.legend(['Density', 'Normal Fit','Histogram' ])
plt.show()
In [184]:
from scipy import stats

def test_for_normality(data, alpha = 5e-2):
  _, p = stats.normaltest(data)
  m, s = stats.norm.fit(trials) # get mean and standard deviation
  print("The p-value of normality test = {:g}".format(p))
  if p < alpha:  # null hypothesis: x comes from a normal distribution
    print("The given data does not follow a normal distribution")
  else:
    print("The given data follows a normal distribution with mean = {:g} and standard deviation = {:g}".format(m,s))

test_for_normality(trials)
The p-value of normality test = 0.0336086
The given data does not follow a normal distribution

A theoretical solution based on simulation

Since we were interested in getting the distribution of the random variable $N$ such that sum of the dice values $X = \sum \limits_{i=1}^N = 10000$. It is clear that the minimum value that $N$ can assume is 1667 when we observe 1666 number of 6's and 1 number of 4 in any order. While the maximum value of $N$ is 10000 when we observe 10000 number of 1's in a row. From the above histogram plot we can observe that $N$ approximately behaves like a normal distribution i.e. $$N \sim N(2857.75, 26.7054^2) \mbox{ approximately} .$$ Upon continuity correction, we can write the mass function of the random variable $N$ as $$\mathbb{P}(N = n) \approx \mathbb{P}(n-0.5 \leq N(2857.75, 26.7054^2) \leq n+0.5) \\ = \mathbb{P}(n-0.5-2857.75 \leq N(0, 26.7054^2) \leq n+0.5-2857.75) \\ = \mathbb{P}\left(\frac{n-0.5-2857.75}{26.7054} \leq N(0, 1) \leq \frac{n+0.5-2857.75}{26.7054}\right) \\ = \Phi\left( \frac{n+0.5-2857.75}{26.7054} \right) - \Phi\left( \frac{n-0.5-2857.75}{26.7054} \right),$$ where $\Phi(\cdot)$ is the distribution function of a standard normal random variable. Recall that the values of $n = 1667, ..., 10000.$